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Abstract 

For consistency (even oracle properties) of estimation and model predic- 
tion, almost all existing methods of variable/feature selection critically depend 
on sparsity of models. However, for "large p and small n" models sparsity as- 
sumption is hard to check and particularly, when this assumption is violated, 
the consistency of all existing estimations is usually impossible because work- 
ing models selected by existing methods such as the LASSO and the Dantzig 
selector are usually biased. To attack this problem, we in this paper propose 
adaptive post-Dantzig estimation and model prediction. Here the adaptability 
means that the consistency based on the newly proposed method is adaptive 
to non-sparsity of model, choice of shrinkage tuning parameter and dimen- 
sion of predictor vector. The idea is that after a sub-model as a working 
model is determined by the Dantzig selector, we construct a globally unbiased 
sub-model by choosing suitable instrumental variables and nonparametric ad- 
justment. The new estimation of the parameters in the sub-model can be 
of the asymptotic normality. The consistent estimator, together with the 
selected sub-model and adjusted model, improves model predictions. Simula- 
tion studies show that the new approach has the significant improvement of 
estimation and prediction accuracies over the Gaussian Dantzig selector and 
other classical methods have. 



*Lu Lin is a professor of the School of Mathematics at Shandong University, Jinan, China. 
His research was supported by NNSF project (10771123) of China, NBRP (973 Program 
2007CB814901) of China, RFDP (20070422034) of China, NSF projects (Y2006A13 and Q2007A05) 
of Shandong Province of China. Lixing Zhu is a chair professor of Department of Mathematics at 
Hong Kong Baptist University, Hong Kong, China. Email: lzhu@hkbu.edu. hk. He was supported 
by a grant from the University Grants Council of Hong Kong, Hong Kong, China. Yujie Gai is a 
PHD student of the School of Mathematics at Shandong University, Jinan, China. The first two 
authors arc in charge of the methodology development and material organization. 



1 



Keywords. Adaptability, bias correction, Dantzig selector, instrumental 
variable, nonparametric adjustment. Ultra high-dimensional regression. 

AMS 2001 subject classification: 62C05, 62F10, 62F12, 62G05. 

Running head. Adaptive post-Dantzig inference. 



2 



1. Introduction 



Estimation consistency is a natural criterion for estimation accuracy. In classical 
settings with small/moderate number of variables in models, this criterion can be 
adopted. For high-dimensional models, particularly, when the number p of variables 
involved is even larger than the sample size n, are called "large p, small n" models. 
However in these paradigm estimation consistency becomes a very challenging issue. 
This is because what we can work on is only working models rather than full models 
after active variables are selected into working models. For variable selection, some 
classical and newly proposed methods are available, such as the LASSO (including 
the adaptive LASSO) and the SCAD. These methods provide consistent and asymp- 
totically normally distributed estimation for the parameters in working models, but 
these properties heavily depend on sparse structure, proper choice of shrinkage tun- 
ing parameter and the diverging rate of the dimension of parameter vector. For the 
relevant references see Huber (1973), Portnoy (1988), Bai and Saranadasa (1996), 
Fan and Peng (2004), Fan, Peng and Huang (2005), Lam and Fan (2008), Huang et 
al. (2008), and Li, Zhu and Lin (2009), among others. As such, for models without 
spare structure, how to construct consistent estimation is a great challenge. It is 
required to develop new or extended statistical methodologies and theories to handle 
this challenge; see for example Donoho (2000), Kettenring, Lindsay and Siegmund 
(2003). 

To this end, we further review existing methods to get motivation for new method- 
ology development. The following methods were developed also under sparse struc- 
ture. The Dantzig selector that was proposed by Candes and Tao (2007) and was 
extended to handle the generalized linear models by James and Radchenko (2009) 
has received much attention. The connection between the Dantzig selector and the 
LASSO was investigated by James et al. (2009). Under the uniform uncertainty 
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principle, the resulting estimator achieves an ideal risk of order 0{ay/logp) with a 
large probability. This implies that for large p, such a risk can be however large and 
then even under sparse structure the estimator may also be inconsistent. To reduce 
the risk and improve the performance of relevant estimation, the Gaussian Dantzig 
Selector, a two-stage estimation, was suggested in the literature (Candes and Tao 
2007). Such an improved estimation is still inconsistent when the shrinkage tuning 
parameter is chosen to be large (for details see the next section). Another method is 
the Double Dantzig Selector (James and Radchenko 2009), by which one may choose 
a more accurate model and, at the same time, get a more accurate estimator. But it 
critically depends on the choice of shrinkage tuning parameter. Motivated by these 
problems. Fan and Lv (2008) introduced a sure independent screening method that 
is based on correlation learning to reduce high dimensionality to a moderate scale 
below the sample size. Afterwards, variable selection and parameter estimation can 
be accomplished by sophisticated methods, such as the LASSO, the SCAD or the 
Dantzig selector. The relevant references include Kosorok and Ma (2007), Van Der 
Lanin and Bryan (2001), Chen and Qin (2010), James, Radchenko and Lv (2009) 
and Kuelbs and Anand (2010), among others. 

However, for any model with very large p, without model sparsity, all existing 
methods cannot provide estimation consistency for working models, and any fur- 
ther data analysis would be questionable unless we can correct biases later or at 
most we can obtain an approximation rather than estimation consistency as the 
sample size goes to infinity. To deal with this problem, we focus our attention 
on working sub-model that is chosen by the Dantzig selector. In this paper, we 
suggest a method to construct consistent and asymptotically normal distributed es- 
timation for the parameters in the sub-model. To achieve this, a nonparametric 
adjustment is recommended to construct a globally unbiased sub- model and to cor- 
rect the bias in working model. Here the nonparametric adjustment may depend on 
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a low-dimensional nonparametric estimation via using proper instrument variables. 
We will show the following properties. The estimator 9 of the parameter vector 9 
in the sub-model satisfies \\9 — 9\\1^ = Op{n~^) and the asymptotic normality if the 
dimension g of ^ is fixed. Even for the case where q tends to infinity, the consis- 
tent and asymptotic normality still hold when q diverges at a certain rate. We will 
briefly discuss the theoretical results for the case with diverging q. Furthermore, 
the new consistent estimator, together with the unbiased adjustment sub-model or 
the original sub-model, can also improve model prediction accuracy. We will prove 
that our method possesses the adaptability. That is, the above properties always 
hold whether the sub-model is small or large, the dimension of the parameter in the 
original model is high or not, and the original model is sparse or not. 

The rest of the paper is organized as follows. In Section 2 the properties of the 
Dantzig estimator for the high-dimensional linear model are re-examined. In Sec- 
tion 3 a bias-corrected sub-model is proposed via introducing instrumental variables 
and a nonparametric adjustment, and a method about instrumental variable selec- 
tion is introduced. Estimation and prediction procedures for the new sub-model are 
suggested and the asymptotic properties of the resulting estimator and prediction 
are obtained. In Section 4 the algorithms for constructing instrumental variables 
are proposed. Simulation studies are presented in Section 5 to examine the perfor- 
mance of the new approach when compared with the classical Dantzig selector and 
other methods. The technical proofs for the theoretical results are postponed to the 
Appendix. 
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2. A brief review for the Dantzig selector 

Consider the model 

Y = (3'X + e, (2.1) 

where Y is the scale response, X is the p-dimensional covariate and e is the random 
error satisfying E{e\X) = and Cov{e\X) = a^. Here p will be greater than n when 
we can collect a sample of size n. Throughout this paper, our primary interest is to 
construct consistent estimators for significant components of the parameter vector 
/3 = ■ ■ ■ G C R^. These significant components of (3, together with the 
corresponding covariates, composes a working model. Then the second interest of 
our paper is to obtain reasonable model prediction via our estimation. 

To introduce the new estimation, we first re-examine the Dantzig selector. Let 
Y = (Fi, ■ ■ ■ , Yn)' be the vector of the observed responses and X = (Xi, ■ ■ ■ , = 
(xi, ■ ■ ■ , Xp) be the nx p matrix of the observed covariates. The Dantzig selector of 
(3 is defined as 

/S"^ = arg min II/3L subject to sup |x'r|<A„cr (2.2) 

I3£.^ l<j<p 

for some Xp > 0, where = Y7j=i r = Y — X/3. As was shown by 

Candes and Tao (2007), under some regularity conditions, this estimator satisfies 
that, with large probability, 

W''-/3\\l<CaHogp, (2.3) 

where C is free of p and \\f3^ ~ P\\'e2 ~ ^i=i \f^F ~ f^jl"^- ^^'^^ ^^^^ ideal risk 
and thus cannot be improved in a certain sense. However, such a risk can become 
large and may not be negligible when the dimension p > n. 

To reduce the risk and enhance the performance in practical settings, one often 
uses a two-stage selection procedure (e. g., the Gaussian Dantzig Selector) to con- 

6 



struct a risk- reduced estimator for the obtained sub-model (Candes and Tao 2007). 
For example, we can first estimate I = {j : f3j ^ 0} with / = {j : |/3j^| > ijo"} for 
some > and then construct an estimator 

= ((x(^'))'X(^'))-i(X(^'))'Y 

for and set the other components of /3 to be zero, where /S^-j^ is the restriction 
of /3 to the set /, and X*-^^ is the matrix with the column vectors according to /. 

Denote = 9, a. g-dimensional vector of interest. Without loss of generality, 
suppose that (3 can be partitioned as /3 = {9','j'y and, correspondingly, X is parti- 
tioned as X = {Z', U'y. Then the above two-stage procedure implies that we can 
use the sub-model 

Y = e'Z + r] (2.4) 

to replace the full-model (2.1), where r] = 7'f/ + e is regarded as error. Here the 
dimension q of 6 can be either fixed or diverging with n at certain rate. Since the 
above sub-model is a replacer of the full model (2.1), we call 6 and Z the main parts 
of /3 and X, respectively. From (2.1) and (2.4) it follows that E{r]\Z) = iE{U\Z). 
When both 7 7^ and E{U\Z) 7^ 0, the sub-model (2.4) is biased and thus the two- 
stage estimator 9s = /?(/) is also biased. It shows that the two-stage estimator 9s oiO 
is also inconsistent. Note that for any non-sparse model, the condition 7 7^ always 
holds. Then the above method is not possible to obtain consistent estimation. 

Another method for improving the Dantzig selector is the Double Dantzig Selec- 
tor. By which more accurate model and estimation can be expected. In the first 
step, the Dantzig selector is used with a relatively large shrinkage tuning parame- 
ter Xp defined above to get a relatively accurate sub-model in the sense that more 
significant variables are contained. The Dantzig selector is further used in the se- 
lected sub-model to obtain a relatively accurate estimator of 9 via a small Ap and 
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data {Y, Z). However, such a method cannot handle non-sparse model because the 
sub-model selected in the first step has already been biased. It is also noted that 
this method critically depends on twice choices of shrinkage tuning parameter Xp] 
for details see James and Radchenko (2009). On the other hand, when estimation 
consistency and normality, rather than variable selection, heavily depend on the 
choice of Ap, it is practically not convenient, and more seriously, the consistency is 
in effect not judgeable unless a criterion of tuning parameter selection can be de- 
fined to ensure consistency. Then it is desirable to have a new estimation/inference 
method with which consistency is free of the choice of Ap. 

3. Adaptive post-Dantzig estimation and prediction 

3.1 Bias-corrected model. As was shown above, the sub-model (2.4) is usually 
biased. Furthermore, this model is regarded as a non-random model after the vari- 
able selection given by the Dantzig selector, i.e., the estimate / for the index set / 
defined in the previous section is fixed after variable selection. 

It is clear that a bias correction is needed for the selected sub- model (2.4) when 
we want to have a consistent estimation of the sub- vector 6 = {di,--- ^Oq)'. To 
this end, a new model with an instrumental variable is established. Denote Z* = 
{Z', U^^\ ■ ■ ■ , U^'^^y and W = AZ*, where A is d x (q + d) matrix satisfying that its 
row vectors have length 1. Without loss of generality, U^^\ ■ ■ ■ , U^'^'^ are supposed 
to be the first d components of U, although they may be chosen as another compo- 
nents of U or pseudo- variables (artificial vavriables). Denote by Aj\/ the maximum 
eigenvalue of UU' and set V = {a'U / p, W')' for some a to be chosen later, where p is 
a nonrandom positive number satisfying the condition p = 0(||a||f2 V-^a/)- Choose 
A and U'^^\ ■■■ , [/('^^ such that 

E{{Z - E{Z\V)){Z - E{Z\V)y} > 0. (3.1) 
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This condition on the matrix we need can trivially hold because V contains W that 
is a weighted sum of Z and f/W,--- ,U^'^l The condition (3.1) can be used to 
guarantee the identifiability of the following model. 

Denote g{V) = ElrjlV). Now we introduce a bias-corrected version of (2.4) as 

= e'Z, + g{V^ + aVi), t = l,---,n, (3.2) 

where ^{V) = r] — g{V). Obviously, if a in ^ is identical to 7 in 77, this model is un- 
biased, i.e., E{^\Z, V) = 0; otherwise it may be biased. This model can be regarded 
as a partially linear model with a linear component 6'Z and a nonparametric com- 
ponent g{V), and is identifiable because of the condition (3.1). From this structure, 
we can see that when V does not contain the instrumental variable W and = 7, 
the model goes back to the original model (2.4) as ^ is zero and g{V) becomes the 
error term rj (if e is ignored). This structure motivates our method. By introducing 
an instrumental variable V so that C, has a zero conditional mean, and then we can 
estimate g{-) to correct the bias occurred in the original model. Although a non- 
parametric function g{v) is involved, it will be verified that the dimension d + 1 of 
the variable v is low. Note that for V, the key is to properly select a and W. From 
the above description, we can see that although a = 7 should be a natural and 
good choice, it is unknown and when the dimension is large, is cannot be estimated 
consistently. Taking this into account, we first consider a general a and construct a 
bias-corrected model with suitable W, or equivalently a suitable matrix A. 

Denote by / = p — q the dimension of 7 and let A = (0, 72 — ^«25 ■ ■ • , li~ ^ctiY / P: 
where ai, ■ ■ ■ , a; are the components of a and ai is supposed to be nonzero. We 
can ensure that, when Zi and Ui satisfy 

yE{Ui\Zi,Wi) = X'E{Ui\Wi), (3.3) 
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the model (3.2) is unbiased, i.e., 

E{aVim,V^ = 0. (3.4) 
The proof of (3.4) will be presented in the Appendix. 

When [Z, U) is elliptically symmetrically distributed, the condition (3.3) can be 
rewritten at population level as the following form: 

X'Eu,z*A'{AJ:z*^z*A')-^A{Z^ - E{Z*)) 

= X'J:u,z*B'{BJ:z*,z*B'y^B{Z* - E{Z*)), ^^'^^ 
where ^z*,z* = Cov{Z\ Z*), ^u,z* = Cov{U, Z*) and 

^ // ... 

A = (Ai, aq_|_i, ■ ■ ■ ,ag+d), Ai is a. d X q matrix and aj,j = q + I,-- - , a^+d, are 
d-dimensional column vectors. Further, the ellipticity condition can be slightly 
weakened to be the following linearity condition: 

E{U\C'Z*) = E{U) + T.u,z*C{C'T.z*^z*Cy^C\Z* - E{Z*)) 

for some given matrix C. This linearity condition also results in (3.5). The linearity 
condition has been widely assumed in the circumstance of high-dimensional models. 
Hall and Li (1993) showed that it often holds approximately when the dimension p 
is high. 

Under either the equation (3.3) or (3.5), the bias-corrected model (3.2) is unbi- 
ased. Thus, we are now in the position to determine the matrix A by solving either 
the equation (3.3) or (3.5). A solution is not difficult to be obtained. For example, 
if T,z*,z* = Iq+d and B^^ exists, then we choose A satisfying 

Eu,z*A'{AA')-'A{Z' - E{Z*)) = Eu,z4Z* - E{Z')). (3.6) 
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It is known that, if we can choose variables U^^\--- ,1/^'^'^ such that the rank of 
matrix T,u^z* is d, then 



where SJ^^* is the Moore- Penrose generahzed inverse matrix of Tju^z*, Id is a d x d 
identify matrix, Q = {Qi,Q2) an orthogonal matrix satisfying Q'Q = Iq^d and 
QiQi = Id- In this case, we choose 



Such a matrix A is a solution of (3.6) and thus a solution of (3.5). With such a 
choice of A, the model (3.2) is always unbiased whether the model (2.1) is sparse or 
not, the dimension of fi is high or low, and the choice of \p is proper or not. 

However, sometimes the matrix S^^^S^/^^* is unknown. Under this situation, we 
will present a detailed procedure in Section 4 to calculate S^^*S[/^^* and A. From 
the above choice of A, we can see that g{v) is a. d + 1-dimensional nonparametric 
function. If d is large, we choose a row vector to replace A and will give a method 
in Section 4 to find an approximate solution. With which, g{v) is a 2-dimensional 
nonparametric function. 

The above deduction shows that the above bias-correction procedure is free of 
the choice of a. However, choosing a proper a is of importance. It is clear that, 
combining (3.2) and (3.3), choosing an a as close to 7 as possible should be a good 
way although optimal choice leaves an unsolved and interesting problem. In the 
estimation procedure, a natural choice is the value 7^ for 7, which is obtained in 
the Dantzig selection step. The details are presented in Subsection 3.2 below. We 
will also discuss the asymptotic properties of an estimation when we use a given a 
in the next subsection. 




A = Q\ 



(3.7) 
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3.2 Asymptotic normality of estimation. Throughout this subsection we as- 
sume that the matrix A satisfying (3.5) or (3.6) has been obtained. Although the 
obtained A is sometimes an estimator rather than an exact solution, in this section 
we still regard it as a nonrandom solution of (3.5) or (3.6) because such an estima- 
tor is i/ri-consistent (see Section 4 below) and, as a result, when A is thought of a 
random vector, the theoretical conclusions given below still hold. 

Recall that the bias-corrected model (3.2) can be thought of as a partially linear 
model. We therefore design an estimation procedure as follows. First of all, as 
mentioned above, for any a, the model (3.2) is unbiased. Then we can design the 
estimation procedure after a is determined by any empirical method. An empirical 
choice a is designed as the Dantzig selector 7^ of 7 determined by (2.2). Generally, 
given 6 and for any a, the nonparametric function g{v) is estimated by 

, ELiiyk-e'z,)LH{Vk-v) 

9e[v ^ — J — 777 ^ , 

where Lh{-) is a. {d + 1) -dimensional kernel function. Then gelv) is a. {d + l)-variate 
nonparametric estimator. As was shown above, the dimension d+1 is low. A simple 
choice of Lh{-) is a product kernel as 

1 /T/(l) _ /T/('^+1) _ 

^«(y - ^0 = ]^^(^^) • ■ ■ — — ). 

where V^^\j = 1, ■ ■ ■ , d + 1, are the components of V, K{-) is an 1-dimensional 
kernel function and h is the bandwidth depending on n. Particularly, when a is 
chosen as 7^, we get an estimator of g{v) as 

where V = {U'^^/p, W')' and p = 0(||7^||^,v^)- 
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With these two estimations of g{v), the bias-corrected model (3.2) can be approx- 
imately expressed by the following two models: 

Y,^e'Z, + gg{V^ + aV^ and Y, ^ 6' + ge{V,) + ^Vi) , 

equivalently, 

Yi^e'Z, + ^{Vi} and Y, ^ 6' Z, + ^{Vi), (3.8) 

where 

Y.^Y._ ^k=lYkLH{Vk-Vd ~y _ Trk=lZkLH{Vk-V^) 

Y.^Y _ ^k=iykLH{Vk-v^ z = z - ELi^fe-^H(v-.-v;) 

Thus, the sub-models in (3.8) result in the estimations for 6 as 

_ n 1 " 

e = s;,'-y^z,Yi and e = s;,'-y^ZiYi, (3.9) 

where S'„ = \^YTi=\ '^i'^'i = ^J2i=i ^i^'i-. respectively. Here we assume 

that the bias-corrected model (3.2) is homoscedastic, that is V ar[^(yi)) = ay or 
Var^^iVi)) = ay for alH = 1, ■ ■ ■ , n. If the model is heteroscedastic, we respec- 
tively modify the above estimators as 

n n 

where SI = i ELi ^)Z^Z[ or 5^ = i ELi respectively, and a^^.) = 

Var{S,{Vi)) and cTj^(Vi) = Var{^{Vi)). Here (Tf(V^j) and ^^^(Vi) are supposed to be 
known. If they are unknown, we can use their consistent estimators to replace them; 
for details about how to estimate them see for example Hardle et al. (2000). In the 
following we only consider the estimators defined in (3.9). Finally, the estimators of 
giy) can be defined as either g^iv) or ggiv). 
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To study the consistency of the estimations, the following conditions for the model 
(3.2) are assumed: 



(CI) The first two derivatives of g{v) and ^{v) are continuous. 
(C2) Kernel function K[-) satisfies 

J K{u)du = 1' y u^K{u)du = 0, j = 1, ■ ■ ■ , - 1, < j v!'K{u)du < oo. 
(C3) n/i^C'^+i) ^ oo. 

Obviously, the conditions (C1)-(C3) are commonly used for semiparametric mod- 
els. Under these conditions, the following theorem provides the consistency of the 
bias-corrected estimator 9. 

Theorem 3.1 Assume that the conditions (C1)-(C3) hold, and for given a, (3.1) 
and (3.3) are satisfied. When q is fixed, and p may be larger than n, then, as n oo, 

^ N{0,al.S-^), 

where S = E{{Z - E{Z\V)){Z - E{Z\V)y}. 

Remark 3.1 For simplicity of presentation, in this theorem we only give the the 
asymptotic normality for the case with fixed q. In fact, when q tends to infinity 
at a certain rate, the asymptotic normality still holds for every component of 6 
(see for example Lam and Fan, 2008). This is because, after bias-correction, the 
model (3.2) is indeed a partially linear model and then the proof can be similar 
with more technical and tedious details. The proof of this theorem is postponed to 
the Appendix. The results in the theorem show that the new estimator 6 is ^/n- 
consistent regardless of the choice of the shrinkage tuning parameter Ap and thus it is 
convenient to be used in practice. Furthermore, by the theorem and the commonly 
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used nonparametric techniques, we can prove that gg{v) is also consistent. In effect, 
we can obtain the strong consistency and the consistency of the mean squared error 
under some stronger conditions. The details are omitted in this paper. 

To investigate the asymptotic properties for the second estimator 6 in (3.9) that 
is based on the Dantzig selector 7^, we need the following more conditions: 

(C4) The bandwidth h is optimally chosen, i.e., h = 0(?7,"^/(^(''+'^+^))). 

(C5) Suppose that there exists a vector, say a, such that ||a||£2 > c for a positive 
constant c and — a||^2/ll7^lk2 = Op{n~^) for some /i satisfying 

1/2 - k/{2{k + + 1)) < < 1/2. 

As was stated in the previous sections, a was an arbitrary vector. The vector 
a in the condition (C5) is then different. But for the simplicity of representation 
we still use the same notation a in different appearance. The condition (C5) is the 
key for the following theorem and corollary. This condition does not mean that the 
Dantzig selector 7^ is consistent. Note that ||7^||^2 is large in non-sparse case, and 
the accuracy of the solution of linear programm can guarantee that ||7^ — aH^j is 
relatively small for the true value of linear programm (2.2) at population level (see 
for example Malgouyres and Zeng, 2009). These show that the condition (C5) is 
reasonable. Both (C4) and (C5) can actually be weakened, but for the simplicity of 
technical proof and presentation, we still use the current conditions in this paper. 

Theorem 3.2 Under the conditions (C1)-(C5), (3.1) and (3.3), we have the 
following asymptotic representation for the second estimator in (3.9): 

1 " 

^{6 -0) = S-^^Y. {^^~9i^i) + + Op(l), 
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where S = E{{Z - E{Z\V)){Z - E{Z\V))'} and 



E 9{Vk)LHiVk-Vi) 



E LniVk-Vi) 

k = l 



E aVk)LH{Vk-Vi) 



E Lniyk-Vi) 



E ZkLniVk-Vi) 

Zi = Zi jj . 

E LHiVk-V^) 

k = l 



The proof of the theorem is given in the Appendix. From Theorem 3.2, and 
Theorem 2.1.2 of Hardle et al (2000), the asymptotic normahty follows directly. 
The following corollary states the detail. 

Corollary 3.3 Under the conditions of Theorem 3.1, when q is fixed but p may 
be larger than n, then, as n oo, 

./^{e-e)^N{<d,als-^). 



As aforementioned in Remark 3.1, for the sub- model with diverging g, the asymp- 
totic normality can still hold under some stronger conditions, the details are omitted 
here. 

3.3 Prediction. Together the estimation consistency with the adjusted sub-model 
(3.2), we obtain an improved prediction as 

Y = e'Z + g^{y) (3.10) 
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and the corresponding prediction error is 

E{Y-Yy = E{{e-eyZf + E{g^{V)-g{V)f + E{e{V)) 

+2E{{e - eyz{9^{v) - giv))) + 2E{{e - eyzay)) 

+2E{{g^{V) - g{VmV)) 

= Eieiv))+oii). 

Such a prediction is of a smaller prediction error than the one by the classical Dantzig 
selector, and interestingly it is no need with any high-dimensional nonparametric 
estimation. 

In contrast, if we use the new estimator 6 and the sub-model (2.4), rather than 
the adjusted sub-model (3.2), to construct prediction, the resulting prediction is 
defined as 

Ys = e'Z + §g, (3.11) 

where 

1 " 

i=l 

For prediction, we need to add g^ in (3.11) because the sub- model (2.4) has a bias 
E{g(y)), otherwise, the prediction error would be even larger. In this case, g^ is 
free of the predictor U and the resultant prediction (3.11) only uses the predictor 
Z in the sub-model (2.4). This is different from the prediction (3.10) that depends 
on both the low-dimensional predictor Z and high-dimensional predictor U. Thus 
(3.11) is a sub- model based prediction. The corresponding prediction error is 

E{Y-Ys? = E{{e-eyzf + EQ^-g{v)f + E{e{v)) 

+2E{{e - eyzQe - giV))) + 2E{{9 - eyzay)) 

+2E{Qf,-g{VmV)) 
= EieiV)) + Var{giV)) + 2i?(E(^(l^)) - giVmV)) + o(l). 

This error is usually larger than that of the prediction (3.10). But, 

\E{E{g{V)) - g{Vm{V))\ < iVar{g{V))Var{m)Y^' 
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and usually the values of both Var{g{V)) and Var{C,{V)) are small. Then such a 
prediction still has a smaller prediction error than the one obtained by the sub-model 
(2.4) and the common LS estimator 63 = (Z'Z)"^Z'Y as: 

Ys = 9'sZ (3.12) 

with the corresponding error as 

E{Y~Ysy = E{{es - eyzy + E{yuy + + 2E{{es - eyzyu). 

Because 6s does not tend to 6, the values of both E{{9s — 9)' ZY and 2E{{6s — 
OyZ-f'U) are large and as a result the prediction error is large. 

The above results show that in the scope of prediction, the new estimator can 
reduce prediction error under both the adjusted sub-model (3.2) and the original 
sub- model (2.4). We will see that the simulation results in Section 5 coincide with 
these conclusions. 

4. Calculation for A 

4.1 Calculation of A for the case with unknown S^^^S^^*. In the previous 
section, we suggested a simple choice of A for the case with known S^^^S^/^^*. We 
now introduce an approach for choosing vector A such that (3.6) holds for the case 
with unknown T,^ ^^,T,^^z* ■ For the convenience of representation, we here suppose 
E{Z) = 0, E(U)=0 and Cov{Z*) = I. In this case, (3.6) can be rewritten as 

Eu,z*A'{AA'y'AZ'' = T.u,z^Z\ (4.1) 
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We denote T,'^ z^,T,u,z* = = if^ij) with 



(j+r. 

where Z*^*^ and t/*^'^-' are the components of Z and U, respectively. It is known that 
Q can be decomposed as 

Q = Q diag{</)i, ■ ■ ■ , 0^, 0, ■ ■ ■ , 0}Q', 

where 4>kik = 1, ■ ■ ■ , rf, are the positive eigenvalues of Q and Q is the orthogonal 
matrix. Note that / depends on n and tends to infinity as n — )■ oo. To get consistent 
estimator of Q, we need the following condition 

#\ foralH,j,/t,. /-^ ^^-2) 

for a positive constant C, where #{5*} denotes the number of elements in the set S. 
Also we can use some weaker conditions to replace (4.2). In fact the conditions we 
need are similar to those required for high- dimensional linear models, for example, 
the weak and strong irrepresentable conditions (Zhao and Yu 2006) and the uniform 
uncertainty principle (Candes and Tao 2007). Note that f2 is a low- dimensional 
matrix. Then, under the condition (4.2), Vt can be A/n-consistently estimated; for 
example, a naive estimator of ujij for i, j < q can be chosen as 



fc=l s=l s=l s=l s=l ^ 



I 

z 

k=l s=l 

where 1{S} is the indicator function of the set 5*. As was shown above, we can 
express Cl as 

fi = gdiag{0i,--- ,0rf,O--- ,0}g' (4.3) 
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and Q as Q = (Qi; Q2)- Finally, the estimator of A is obtained by 

i = Qi- 



4.1 Calculation of A for large d. As we mentioned before, when d is large, 
the solution A of (4.1) has d columns and then [d + l)-dimensional nonparametric 
estimation will be involved, which leads an inefficient estimation. Thus, we consider 
an approximation solution of (4.1), which is a row vector. Without confusion, we 
still use the notation A to denote this row vector. That is, we choose a row vector 
A such that 

A'AZ* = S+^.S[/,z*Z*. (4.4) 

The approximation solution is identical to the solution of (4.1) in form as when A is 
a row vector, recalling that it is normalized to be norm one, AA' = 1. In this case, 
to get a low-dimensional nonparametric function g{v), we choose d = 1, i.e., Z* is a 
q + 1-dimensional vector. Similar to the above determination, when A is unknown, 
we can also construct an estimation as follows. Denote A = (ai,--- ,aq,aq^i), 
Ak = a^A and E^z*^u,z* = {D[, ■■■ ,Dg, -D^+i)', where D^, = 1, ■ ■ ■ , g + 1, are 
{q + l)-dimensional row vectors. Then we estimate A via solving the following 
optimization problem: 

g+i 

inf |Q(ai,--- ,a,+i) : J^a' = l}, (4.5) 

k=l 

where Q(ai, • • ■ , a^+i) = ^ XlLi Wi^k- Dk)ZI\\'^. By the Lagrange multiplier, 

we obtain the estimators of Ak, k = 1, ■ ■ ■ , g + 1, as 

Ak = [Dk- J2 ZtZf + cckek/2) (- Z^Zf + c^j) , (4.6) 

i=l 1=1 

where > 0, which is similar to a ridge parameter, depends on n and tends to zero 
as n — > oo, and Ck is a row vector with A;-th component 1 and the others being zero. 
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Note that the constraint \\A\\ = 1 imphes \\Ak\\ = ±afc. Finally, by combining (4.6) 
and this constraint we get an estimator of as 

Ofc = ±\\Ak\\ 

and consequently the estimator of A is obtained by 

A = (ai, ■ ■ ■ ,dq,aq+i). 

5. Simulation studies 

In this section we examine the performance of the new method by simulations. 
By mean squared error (MSE), model prediction error (PE) and their stci MSE and 
stdPE as well, we compare the method with the Gaussian-dantzig selector first. In 
ultra-high dimensional scenarios, the Dantzig selector cannot work well, we use the 
sure independent screening (SIS) (Fan and Lv 2008) to bring dimension down to a 
moderate size and then to make comparison with the Gaussian-dantzig selector. As 
is well known, there are several factors that are of great impact on the performance 
of variable selection methods: dimensions p of covariate X, correlation structure 
between the components of covariate X, and variation of the error which can be 
measured by theoretical model R-square defined by = (yar{Y) — a'^)/Var(Y). 
In order to comprehensively illustrate the theoretical conclusions and performance, 
we design three experiments. The main goal of the first experiment is to examine 
the effect of as the smaller R^ is, the more difficult correctly selecting variables is. 
The second experiment is to investigate the impact from the correlation between the 
components of covariate X, and the third is to check whether the two-step procedure 
of the SIS and the Dantzig selector works or not. 

Experiment 1. This experiment is designed mainly for: (1) comparing the new 
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estimator 9 defined by (3.9) with the Gaussian-dantzig selector 9s'-, (2) examining 
the effect of different choices of the theoretical model of the full model (2.1); 
(3) checking the effect of the correlation between the components of X when is 
fixed. To achieve these goals, we compare the MSEs, the PEs and their strfMSE 
and stdVE of the two different estimators 9 and 6's, and the two models (2.4) and 
(3.2). In the simulation, to determine the regression coefficients in our simulation, 
we decompose the coefficient vector /3 as two parts: /?/ and /?_/, where / denotes the 
set of locations of significant components of /?/, and let S = |/| denote the number 
of elements contained in J. Three types of /?/ are considered: 
Type (I): /?/ = (!, 0.4, 0.3, 0.5, 0.3, 0.3, 0.3)' and /= {1,2,3,4,5,6,7}; 
Type (II): (3i = (1, 0.4, 0.3, 0.5, 0.3, 0.3, 0.3)' and / = {1, 17, 33, 49, 65, 81, 97}; 
Type (III): /?/ = (1, 0.4, -0.3, -0.5, 0.3, 0.3, -0.3)' and / = {1, 2, 3, 4, 5, 6, 7}. 
As it is very rare that all other coefficients are exactly zero, non-sparse models 
are considered. To mimic practical scenarios, we set the values of the components 
/3_/j's of as follows. Before performing the variable selection and estimation, 
we generate /3-/j's from uniform distribution W(— 0.5,0.15) and the negative values 
of them are then set to be zero. After the coefficient vector /3 is determined, we 
consider it as a fixed value vector and regard /?/ as the main part of the coefficient 
vector p. We use this way to set the values of /3_/i's because in the simulations 
below, there are too many insignificant variables with small/zero coefficients and it 
makes little sense to give a common value for them. As too many values for these 
insignificant coefficients, we do not list all of them here. We use / to denote the set of 
subscript of coefficients 9 in /3, that is the coefficients' subscript of variables selected 
into sub-model, we assume X ~ Np{fi,T,x), with /i the components corresponding 
to I are and others are 2 and the {i,j)-th. element Sjj = (— p)'*"-'', < p < 1. 
Furthermore, the error term e is assumed to be normally distributed as e ~ N{0, cr^). 
In this experiment, we choose different a to obtain different type of full model with 
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different R^. In the simulation procedure and the kernel function is chosen to be 
Gaussian kernel K(u) = exp{ — ^}. In this experiment, the choice of parameter 
Ap in the Dantzig selector is just like that given by Candes and Tao (2007), which 
is the empirical maximum of over several realizations of 2 ~ A^(0, /„). 

The following Tables 1 and 2 report the MSEs and the corresponding PEs via 
200 repetitions. In these tables, Y is the prediction via the adjusted model (3.2) 
that is based on the full dataset, I5 is the prediction via the sub-model (2.4) with 
the new estimator Q defined in (3.9), Ys stands for the prediction via the sub-model 
(2.4) and the Gaussian-dantzig selector 6*5. For the definitions of y, Y5 and Ys see 
(3.10), (3.11) and (3.12), respectively. The purpose of such a comparison is to see 
whether the adjustment works and whether we should use the sub-model (2.4) when 
the high- dimensional data are not available (say, too expensive to collect), whether 
the new estimator Q together with the sub-model (2.4) is helpful for prediction 
accuracy. The sample size is 50, and for the prediction, we perform the experiment 
with 200 repetitions to compute the proportion r of which the prediction error of 
Ys is less than that of Ys in the 200 repetitions. The larger r is, the better the 
new estimator is. We have the following considerations in designing the experiment: 
a). We will study models with the theoretical model Fc' ranging between 0.3 and 
1.0, which can be determined by the value of the variance of error term o"^, here we 
choose (7^=0.2, 0.6, 0.9, 1.3 and 1.9 respectively; b). The correlation between the 
components of X should have effect for the estimation, we then consider different 
correlation coefficients 0.1 and 0.7. 

1. Let n = 50,p = 100,5 = 7 and p = 0.1 . For each type of /3, we choose 
different a to control the theoretical and consider five cases. 
For type (I), we have the following results: 
Case 1. = 0.98, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 2, 3, 4, 6, 7}; 
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Case 2. = 0.82, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 2, 4, 6, 7, 55}; 
Case 3. = 0.67, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 2, 3, 4, 15, 22, 28, 81}; 
Case 4. R^ = 0.50, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 2, 4, 27, 29, 49, 53, 84}; 
Case 5. R^ = 0.31, / = {1, 2, 3, 4, 5, 6, 7} and / = {1,4, 5, 24, 25, 42, 43, 62}. 
For type (II), we have the following results: 

Case 1. R^ = 0.98, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 17, 33, 49, 65, 81, 97}; 

Case 2. R^ = 0.84, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 17, 33, 43, 49, 81}; 

Case 3. R^ = 0.71, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 15, 17, 33, 49, 62, 72}; 

Case 4. R^ = 0.53, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 5, 26, 29, 33, 43, 49, 53, 65, 74}; 

Case 5. R^ = 0.35, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 7, 17, 26, 29, 31, 49, 72, 80, 96, 97, 98}. 

For type (III), we have the following results: 

Case 1. i?2 = 0.98, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 2, 3, 4, 5, 6}; 

Case 2. R^ = 0.83, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 3, 4, 5, 6, 7, 15}; 

Case 3. R^ = 0.69, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 2, 4, 5, 7, 92}; 

Case 4. R^ = 0.51, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 5, 7, 8, 67, 71}; 

Case 5. R^ = 0.33, / = {1, 2, 3, 4, 5, 6, 7} and / = {1,4, 6, 7, 21, 23, 38, 50, 75, 83}. 

Table 1. MSE, PE and their standard errors with n = 50, p = 100, S = 7 and p = 0.1 
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type 


MSE(.sMMSE) 
e 0s 


PE{stdPE) 
Y Ys Ys 


T 


0.98 
0.82 
(I) 0.67 
0.50 
0.31 


0.0032(0.0118) 0.0866(0.3519) 
0.0134(0.0544) 0.1197(0.1654) 
0.0273(0.1288) 0.0430(0.1283) 
0.0543(0.2387) 0.0694(0.2221) 
0.1028(0.4689) 0.1131(0.4876) 


0.1630(0.0405) 0.2299(0.0535) 1.1587(0.5549) 
0.6603(0.1497) 0.7249(0.1564) 1.4755(0.3475) 
1.3038(0.2952) 1.3438(0.3018) 1.4821(0.3266) 
2.5371(0.5500) 2.5919(0.5633) 2.7176(0.6020) 
4.9199(1.1856) 4.9960(1.2070) 5.0708(1.1965) 


200/200 
200/200 
166/200 
142/200 
126/200 


0.98 
0.84 
(II) 0.70 
0.53 
0.35 


0.0052(0.0202) 0.3540(1.4263) 
0.0162(0.0686) 0.4087(0.3730) 
0.0292(0.1112) 0.1770 (0.2559) 
0.0588(0.3024) 0.0942(0.2988) 
0.1107(0.6896) 0.1251(0.6368) 


0.2584(0.0569) 0.2744(0.0583) 1.1324(2.4262) 
0.8310(0.1823) 0.8417(0.1834) 3.7996(0.7909) 
1.4761(0.3028) 1.4727(0.3018) 2.6389(0.5804) 
2.8825(0.6534) 2.8700(0.6460) 3.2707(0.6758) 
5.4055 (1.1809) 5.3896(1.1856) 5.6004(1.2280) 


200/200 
200/200 
199/200 
171/200 
141/200 


0.98 
0.83 
(III) 0.69 
0.51 
0.33 


0.0028(0.0113) 0.0879(0.2938) 
0.0114(0.0531) 0.0873(0.1589) 
0.0234(0.0934) 0.1294(0.1667) 
0.0529(0.1715) 0.0913(0.1775) 
0.1006(0.5013) 0.1083(0.5158) 


0.1643(0.0410) 0.2365(0.0537) 1.2282(0.5590) 
0.5874 (0.1332) 0.6938(0.1533) 1.3483(0.3118) 
1.1922(0.2857) 1.2445(0.2961) 1.9950(0.4379) 
2.6373(0.5788) 2.7418(0.6098) 2.9601(0.6288) 
5.0952(1.2099) 5.1720(1.2241) 5.2372(1.2594) 


200/200 
200/200 
196/200 
164/200 
119/200 



The simulation results are reported in Table 1. The results suggest that the 
adjustment of (3.2) works very well, the corresponding estimation and prediction 
are uniformly the best among the competitors. Further, as we mentioned, when 
the full dataset is not available and we thus use the sub-model of (2.4), the new 
estimator 9 is also useful for prediction. It can be seen that Ys is better than Yg, 
and the value of r is larger than 0.7 in 13 cases out of 15 cases and in the other 2 
cases, it is larger than or about 0.6. 

2. To provide more information, we also consider the case with higher correlation 
p = 0.7: n = 50, p = 100,5* = 7. Also different a's are chosen to control the 
theoretical R^. 



For type (1), we consider the following five cases. 



Case 


1. 




= 0.96, 


/ = 


{1,2,3,4, 


5, 6, 7} and / = 


{1,2,4,5,6,7} 


Case 


2. 




= 0.71, 


/ = 


{1,2,3,4, 


5, 6, 7} and / = 


{1,2,4,81}; 


Case 


3. 




= 0.53, 


/ = 


{1,2,3,4, 


5, 6, 7} and / = 


{1,4,8,9}; 


Case 


4. 




= 0.35, 


/ = 


{1,2,3,4, 


5, 6, 7} and I = 


{1,4,8,51}; 


Case 


5. 


R' 


= 0.20, 


/ = 


{1,2,3,4, 


5, 6, 7} and / = 


{1,2,6,84}. 
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For type (II), we consider the following five cases. 

Case 1. i?2 = 0.98, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 17, 33, 49, 65, 97}; 

Case 2. R'^ = 0.84, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 18, 49, 65, 97}; 

Case 3. R"^ = 0.69, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 2, 49, 52, 65}; 

Case 4. i?^ = o.52, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 15, 33, 49, 76, 84, 98}; 

Case 5. R"^ = 0.34, / = {1, 17, 33, 49, 65, 81, 97} and / = {1, 2, 24, 48, 49, 55, 87, 97}. 

For type (HI), we consider the following five cases. 

Case 1. R"^ = 0.96, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 2, 4, 6, 7}; 

Case 2. R"^ = 0.74, / = {1, 2, 3, 4, 5, 6, 7} and / = {1,4, 6, 7}; 

Case 3. i?^ = o.56, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 6, 7, 33, 56}; 

Case 4. R"^ = 0.38, / = {1, 2, 3, 4, 5, 6, 7} and / = {1,4, 7, 51, 93}; 

Case 5. R^ = 0.23, / = {1, 2, 3, 4, 5, 6, 7} and / = {1, 2, 7, 31, 45, 80, 85, 88}. 
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Table 2. MSE, PE and their standard errors with n = 50, p = 100, S = 7 and p = 0.7 



type R'^ 


MSE(sirfMSE) 

h 


PE(sidPE) 
Y Ys Ys 


r 


0.96 
0.71 
(I) 0.53 
0.35 
0.2 


0.0136(0.0504) 0.3285(0.4226) 
0.0253(0.1426) 0.0709(0.2401) 
0.0373(0.1621) 0.1108(0.2310) 
0.0613(0.3122) 0.0999(0.3289) 
0.1198(0.6479) 0.1292(0.6619) 


0.2472(0.0517) 0.2706(0.0599) 1.7397(0.3804) 
0.6530(0.1463) 0.6945(0.1557) 1.9892(0.2070) 
1.2779(0.2744) 1.3235 (0.2861) 1.5985(0.3736) 
2.3431(0.5342) 2.3694(0.5395) 2.6339(0.5799) 
5.1184(1.2643) 5.1347(1.2729) 5.1764(1.2420) 


200/200 
197/200 
177/200 
161/200 
129/200 


0.98 
0.84 
(11) 0.69 
0.52 
0.34 


0.0122(0.0484) 0.2730(0.3789) 
0.0201(0.0924) 0.1799(0.2037) 
0.0303(0.1338) 0.2899(0.4442) 
0.0644(0.3395) 0.11411(0.4388) 
0.1245(0.5615) 0.1831(0.6787) 


0.2648(0.0730) 0.2809(0.0757) 1.1952(0.2440) 
0.6567(0.1453) 0.6580(0.1452) 1.6477(0.3560) 
1.2955(0.2992) 1.2996(0.3047) 2.7125(0.5861) 
2.5572(0.5558) 2.5633(0.5582) 3.2790(0.6834) 
5.0731(1.1850) 5.0818(1.1743) 5.5988(1.2782) 


200/200 
200/200 
200/200 
191/200 
161/200 


0.96 
0.74 
(111) 0.56 
0.38 
0.23 


0.0239(0.0626) 0.6020(2.1653) 
0.0315(0.1158) 0.4401(0.5248) 
0.0749(0.2373) 0.1736(0.2679) 
0.0687(0.3227) 0.1701(0.3809) 
0.1740(0.8078) 0.2446(0.8718) 


0.2596(0.0560) 0.2897(0.0630) 1.6754(1.4970) 
0.6435(0.1435) 0.6485(0.1442) 2.7859(0.6035) 
1.3334(0.2947) 1.4367(0.3217) 1.8643(0.3965) 
2.3637(0.4538) 2.4645(0.4818) 2.9415(0.5992) 
4.8488(1.1812) 4.8887(1.1968) 5.1471(1.1499) 


200/200 
200/200 
189/200 
178/200 
145/200 



Table 2 shows that when p is larger, the conclusions about the comparison are 
almost identical to those presented in Table 1; Thus it concludes that no matter p 
is larger or not, our new method always works quite well. 

We are now in the position to make another comparison. In Experiments 2 and 
3 below, we do not use the data-driven approach as given in Experiment 1 to select 
Ap, while manually select several values to see whether our method works or not. 
This is because in the two experiments, it is not our goal to study shrinkage tuning 
parameter, but is our goal to see whether the new method works after we have a 
sub-model. 

Experiment 2. In this experiment, our focus is how the correlation between 
variables affects the estimations. The distribution of X is the same as that in 
Experiment 1 except for the dimension. The coefficient vector f5 is designed as 



27 



type (I) in Experiment 1. Furthermore, the error term e is assumed to be normally 
distributed as e ~ A^(0, 0.22). 

As different choices of Ap will usually lead to different sub-models, equivalently, 
to different estimators / of /, we are then able to examine, when the numbers of 
significant variables that are included into the submodels are different, the perfor- 
mance of the new estimation by MSE and PE. In this experiment, we consider two 
cases with two values of A. The setting is as follows. For n = 50, p = 100, S = 7, 
p = 0.1, 0.3, 0.5, 0.7. We consider two cases for each p: 
p = 0.1 : 

Case 1. Ap = 3.97, J={1,2,3,4,5,6,7}, J={1, 2, 3, 4, 5, 6, 7 } 
Case 2. Ap = 6.53, /={1,2,3,4,5,6,7}, /={1, 3, 4, 6, 95 } 
p = 0.3 : 

Case 1. Ap = 3.32, /={1,2,3,4,5,6,7}, /={1, 2, 3, 4, 5, 6 } 
Case 2. Ap = 6.77, /={1,2,3,4,5,6,7}, /={ 1, 2, 4, 6, 23 } 
p = 0.5 : 

Case 1. Ap = 3.72, J={1,2,3,4,5,6,7}, J={1, 2, 4, 5, 6, 7 } 
Case 2. Ap = 7.29, /={1,2,3,4,5,6,7}, J={1, 4, 5, 7, 41, 58, 72 } 
p = 0.7 : 

Case 1. Ap = 3.50, /={1,2,3,4,5,6,7}, /={1, 3, 4, 7, 41, 75} 

Case 2. Ap = 7.22, /={1,2,3,4,5,6,7}, J={1, 4, 7, 51, 64, 67, 68, 83 } 
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Table 3. MSE, PE and their standard errors with n = 50,p= 100, 5 = 7 



p Case 


MSE(strfMSE) 
Os 


PE(sidPE) 
Y Ys Ys 


T 


0.1 I 


0.0052(0.0242) 0.2929(0.3877) 
0.0104(0.0357) 0.2347(0.1784) 


0.2580(0.0528) 0.2612(0.0527) 3.0195(0.6691) 
0.5135(0.1074) 0.6430(0.1282) 5.921(0.4172) 


200/200 
200 /200 


0.3 I 


0.0070(0.0289) 0.4067(1.6692) 
0.0163(0.0458) 0.5048(0.4107) 


0.2732(0.0590) 0.3324(0.0735) 5.6406(1.8289) 
0.4048(0.0881) 0.5014(0.1078) 6.4471(0.7697) 


200/200 
200/200 




0.0079(0.0336) 0.4826(1.9425) 
0.0136(0.0512) 0.1532(0.1835) 


0.2436(0.0551) 0.3053(0.0674) 5.8204(1.8152) 
0.3655(0.0841) 0.4245(0.0914) 6.4357(0.3262) 


200/200 
200/200 


- I 


0.0157(0.0602) 0.2296(0.2970) 
0.0149(0.0637) 0.1914(0.1420) 


0.2688(0.0580) 0.3198(0.0711) 6.6313(0.3560) 
0.2974(0.0624) 0.3225(0.0672) 7.5435(0.1169) 


200/200 
197/200 



From Table 3, we can see clearly that the correlation is of impact on the perfor- 
mance of the variable selection methods: the estimation gets worse with larger p. 
However, the new method uniformly works much better than the Gaussian Dantzig 
selector, when we compare the performance of the methods with different values of 
A and then with different sub-models. We can see that in case I, the sub-models 
are more accurate than those in case II in the sense that they can contain more 
significant variables we want to select. Then, the estimation based on the Gaussian 
Dantzig selector can work better and so can the new method. Note that p is about 
1 meaning that in all the 200 repetitions, Y < Ys- 

In the following, we consider ultra high-dimensional data. 

Experiment 3. For very large the Dantzig selector method alone cannot work 
well. Thus, we use the sure independent screening (SIS, Fan and Lv 2008) to reduce 
the number of variables to a moderate scale that is below the sample size, and then 
perform the variable selection and parameter estimation afterwards by the Gaussian 
Dantzig selector and our adjustment method. 

We first consider n = 100, p = 1000 and 5 = 10 with p=0.1, 0.5 and 0.9 respec- 
tively, and for each p two Xp are used to obtain two / as follows. 
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For p=0.1, I3i = (1.0,-1.5,2.0,1.1,-3.0,1.2,1.8,-2.5,-2.0,1.0)', consider two cases: 

Case 1. Ap=4.50, /={1,2,3,4,5,6,7,8,9,10}, / = {1, 3, 5, 6, 7, 8, 9, 318, 514, 723, 760}; 

Case 2. Ap=7.30, /={1,2,3,4,5,6,7,8,9,10}, / = {2, 3, 5, 8, 515, 886}. 

For p=0.5, I3j = (1.0,-1.5,2.0,1.1,-3.0,1.2,1.8,-2.5,-2.0,1.0)', consider two cases: 

Case 1. Ap=3.56, /={1,2,3,4,5,6,7,8,9,10}, / = {1, 2, 5, 7, 8, 9, 846, 878, 976}; 

Case 2.Ap=6.92, /={1,2,3,4,5,6,7,8,9,10}, / = {2, 3, 5, 8, 10, 882, 963}. 

For p=0.9, Pi = (1.0,-1.5,2.0,1.1,-3.0,1.2,1.8,-2.5,-2.0,1.0)', consider two cases: 

Case 1. Ap=1.80, /={1,2,3,4,5,6,7,8,9,10}, / = {3, 5, 8, 10, 415, 432}; 

Case 2.Ap=5.83, /={1,2,3,4,5,6,7,8,9,10}, / = {2, 3, 5, 114, 121, 839, 853, 882, 984}. 

With this design, the A in case 1 results in that more significant variables are 
selected into the sub-model than those in case 2 so that we can see the performance 
of the adjustment method. 



Table 4. MSE, PE and their standard errors with n = 100, p = 1000, S = 10 



p Case 


MSE(sfdMSE) 
e Os 


PE(sfdPE) 
Y Ys Ys 


r 


\ 


0.7588(0.3497) 71.4031(7.5501) 
0.8523(0.5343) 122.8426(15.0952) 


6.8104(1.5485) 8.0107(1.6574) 94.7515(19.2968) 
13.1274(2.7772) 16.0812(3.4160) 189.7134(34.8081) 


200/200 
200/200 


\ 


3.6170(1.1823) 104.8420(13.5089) 
3.4771(1.2683) 92.3485(12.5122) 


9.9151(1.9902) 11.2352(2.2316) 133.4762(26.5058) 
11.6643(2.6704) 12.7811(2.8941) 134.3821(24.4896) 


200/200 
200/200 


0.9 \ 


5.9027(2.7039) 107.6118(23.4383) 
3.8963(2.1760) 59.1525(11.3152) 


8.2842(1.6181) 11.3518(2.1745) 148.3143(27.4828) 
10.8033(2.1411) 12.9395(2.4835) 68.7272(13.4061) 


200/200 
200/200 



From Table 4, we can see that the SIS does work to reduce the dimension so 
that the Gaussian Dantzig selector and our method can be performed. Whether the 
correlation coefficient is small or large (the values of p change from 0.1 to 0.9), the 
new method works better than the Gaussian Dantzig selector. The conclusions are 
almost identical to those when p is much smaller in Experiments 1 and 2. Thus, 
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we do not give more comments here. Further, when comparing the results of case 1 
and case 2, we can see that the adjustment can work better when the submodel is 
not well selected. The value of p = 1. 

In the following we check the effect of model size when the dimension is larger. 
In doing so, we choose n = 150, p = 2000, p = 0.3 with S = 5, 10. For each S we 
choose two Xp to obtain two /. 

For S=5, f3i = (4.0, —1.5, 6.0, —2.1, —3.0)', we consider two cases: 
Case 1. Ap=3.45, J={1,2,3,4,5}, J = {1, 2, 3, 4, 5, 15, 1099, 1733}; 
Case 2. Ap=8.36, /={1,2,3,4,5}, / = {1,3,554,908}. 

For ^=10, (3i = (4.0,-1.5,6.0,-2.1,-3.0,1.2,3.8,-2.5,-2.0,7.0)', consider two 
cases: 

Case 1. Ap=3.02, /={1, 2,3,4,5,6,7,8,9,10}, / = {1, 2, 3, 5, 7, 8, 9, 10, 1701}; 
Case 2. Ap=9.08, /={1,2,3,4,5,6,7,8,9,10}, J = {1, 3, 5, 7, 8}. 



Table 5. MSE, PE and their standard errors with n = 150, p = 2000, p = 0.3 



S Case 


MSE(stdMSE) 

e 0s 


PE(stdPE) 
Y Ys Ys 


T 


^ I 


0.4245(0.2102) 262.6392(21.2109) 
1.9510(1.0923) 359.5838(32.4150) 


6.4015(1.3038) 6.3439(1.2879) 322.9945(62.6228) 
24.1959(4.8932) 24.8013(5.1629) 559.3584(98.1216) 


200/200 
200/200 


- I 


0.8799(0.5108) 498.7862(59.0383) 
1.8524(0.7599) 68.1862(43.3612) 


10.6009(2.3903) 12.3505(2.6381) 946.3400(175.1009) 
15.0471(2.8069) 16.9161 (3.1755) 1623.4936(111.5972) 


200/200 
200/200 



The results in Table 5 show that the SIS is again useful for reducing the dimension 
for the use of the Gaussian Dantzig selector and our method. When the model size 
is smaller, estimation accuracy can be better with smaller MSE and PE. In other 
words, when the model size is smaller, variable selection can perform better and 
sub-model can be more accurate (case 1 with S* = 5), the adjustment method does 
not have much help, and in contrast, it is useful for improving estimation accuracy 
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when the sub-model is very different from the full model. 

In summary, the results in the five tables above clearly show the superiority of 
the new estimator 9 and the new sub-model (3.2)/the sub-model (2.4) over the oth- 
ers in the sense with smaller MSEs, PEs and standard errors, and large proportion 
r. The good performance holds for different combinations of the sizes of selected 
sub- models (values of Ap), n,p,S,I, and the correlation between the compo- 
nents of X. The new method is particularly useful when a submodel, as a working 
model, is very different from underlying true model. Thus, the adjustment method 
is very worth of recommendation. However, as a trade-off, the adjustment method 
involves nonparametric estimation, although low-dimensional ones, it might not be 
that helpful when the sub-model is accurate enough. Thus, we may consider using 
it after a check whether the submodel is significantly biased. The relevant research 
is ongoing. 



Appendix 

Proof of (3.4) Note that 

EiaV)\Z,V) 

= E{Y -e'Z-g{V)\Z,V) 

= E{Y - 9'Z\Z, V) - E{g{y)\Z, V) 

= E{iU + e\Z, V) - E{E{iU + e\V)\Z, V) 

= iE{U\Z, V)-iE{U\V) 

= iE{U\Z, a'U/p, W) - iE{U\a'U/p, W). 
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Further, 



E{iU/p\Z = z, a'U/p = t,W = w) 

= E{iUlp\Z = z, ?7(i) = {tp - Y!j=2 U^'^a,)/au W = w) 
= E{^,{tP - EU U^'^^^) + S=2 U^H/P\Z = z,W = w) 
= Ei^J + EU U^'Kl, - ^,c,,)/p\Z = z,W = w) 
= ^J + X'E{U\Z = z,W = w). 

Similarly, E{yU/p\W = w,U'-f/p = t) = + X'E{U\W = w). Combining the 



as required. □ 

Proof of Theorem 3.1 The proof follows directly from the unbiasedness of the 
model (3.2) for any a and Theorem 2.1.2 of Hardle et al (2000). □ 



Proof of Theorem 3.2 Denote V* = {Y'U/p*,W'y and p* = OiWYh^V^i), 
where 7* is a /-dimensional vector between and a. Then there exists a vector 7* 
such that 



above results leads to 



E{aV)\Z, V) = p\'iE{U\Z, W) - E{U\W)), 



k=l 



k=l k=l 



where Lh{-) is the derivative of Lh{-)- By the conditions (CI) and (C5), we have 



and, consequently. 



k=l 
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By standard nonparametric technique, see Hardle, et al (2000), it is easy to have 
-n t LniV, -v)- fy{v) = O, [h' + , 

and then 

^ E Lnin - v) - fv{v) = 0,{h' + ^^^) + 0,{n-^), 
where fv is the density function of V. Similarly, we can prove 

- ZkLniVk -v)- zfzyiz, v)dz = oJh' + _-_ ) + Op{n-'^), 



where fzy is the joint density function of {Z, V). Combining the results above leads 
to Z = Z - E{Z\V) + Op(^h'' + + Op{n-^') and, consequently. 



Further, by the definition of 6 and the above result, we have 



i=l 

where 



9iv^ = 



1 v^ri 



Again by the conditions (CI) and (C4), we have 
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Note that, under the condition (C4), 

Therefore combining the above results can complete the proof of the theorem. q 

Proof of Corollary 3.3 The proof follows directly from the result of Theorem 3.2 
and Theorem 2.12 of Hardle et al (2000). □ 
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